What grammars tell us about corpora: the case of reduced relative clauses
نویسندگان
چکیده
We present a large (65 million words of Wall Street Journal) and in-depth corpus study of a particular syntactic ambiguity to investigate (1) to what extent the structure of a grammar is reflected in a corpus, and (2)how probability flmctions defined according to a grammar fit independently established measures of syntactic disambiguation preference. We look at the well-known case of the ambiguity between a main clause and reduced relative construction. We measure the probability distributions of several linguistic features (transitiv-ity. tense, voice) over a sample of optionally intransitive verbs. In agreement with recent re-suits on parsing with lexicalised probabilistic grammars (Collins, 1997; Srinivas, 1997), we find that statistics over lexical, as opposed to structural, features best correspond to human intuitive .judgments and to experimental findings. These results are enlightening to investigate novel uses of corpora, by assessing the portability of statistics across tasks, and by determining what is needed for useful syntactic annotation of corpora. 1 Introduction Most linguistic work until the 1950s studied language use. which required attention to detail and exceptions, and led to the development of data-driven theories and to the use of corpora to model naturally occurring language. Later on. linguists mostly studied grammars, which focussed on generalities and regularities, and led to the formulation of strong theories and to the study of similarity across languages. Some of the current "empirical" approaches integrate tlle corpus-based lessons with the depth of insight that the study of grammar has brought to the study of language. Empirically-induced models that learn a linguistically meaningflll grammar (Collins, 1997) seem to give tile best practical results in statistical natural language processing. One of the reasons wily these models perform so well compared to probabilistic context-free grammars is that they incorporate detailed lexical knowledge at all points in tile derivation (Charniak, 1997). At the same time they perform better than string-based approaches because they retain structural knowledge, such as phrase structure , subcategorization and long distance dependencies. So they are equally capable of modelling the fine lexical idiosyncrasies and tile more general syntactic regularities. Given an annotated training corpus, such methods learn its distributions (the lexical co-occurrences), which requires being given the correct space of events in the model-that is, the grammar-accurately enough that they can parse new instances of the same corpus. The success of such models suggests that a statistical model nmst have access to tile appropriate linguistic features to make …
منابع مشابه
What grammars tell us about corpora : the case of reduced relative clausesPaola
We present a large (65 million words of Wall Street Journal) and in-depth corpus study of a particular syntactic ambiguity to investigate (1) to what extent the structure of a grammar is reeected in a corpus, and (2) how probability functions deened according to a grammar t independently established measures of syntactic disambiguation preference. We look at the well-known case of the ambiguity...
متن کاملThe comprehension of Relative Clauses within Movement-based hypothesis in L1 and L2: the Case of Tati and English
متن کامل
An HPSG-Analysis for Free Relative Clauses in German
At the moment there is no theory for free relative clauses in German in the framework of Head-driven Phrase Structure Grammar (HPSG) (Pollard and Sag, 1994). From GB literature on the subject it is known that free relative clauses behave partly like noun phrases. They can fill argument positions of verbs. And although they are finite sentences, they are serialized like noun phrases in the Germa...
متن کاملSentence Processing Among Native vs. Nonnative Speakers: Implications for Critical Period Hypothesis
The present study intended to investigate the processing behavior of 2 groups of L2 learners of English (high and mid in proficiency) and a group of English native speakers on English active and passive reduced relative clauses. Three sets of tasks, an offline task, and 2 online tasks were conducted. Results revealed that the high-proficiency group’s performance was the same as that of the nati...
متن کاملExtracting Syntax Statistics from Large Corpora of Written English
The field of linguistics has seen a growing interest in the statistics of everyday language. In studying how we acquire language and why some of its aspects are more difficult for us than others, it is critical to understand the linguistic environment to which we are exposed. However, gathering statistics over syntactic structures, even with a syntactically tagged corpus, can be difficult and t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1998